The present application relates to information management technologies, and more particularly, to technologies for automated topic discovery in documents, term importance determination, automatic content categorization, content highlighting and summarization, information presentation, and document search and ranking.
Information overload is a common problem in the information age. Accurate and efficient information access, including collection, storage, organization, search and retrieval is the key to success in this information age.
Much of the information is contained in natural language contents, such as text documents. One particular challenge in information management is to efficiently handle what is called the “unstructured data”. Usually, a document collection in its natural state is unorganized, or in a so-called unstructured state. Examples of such documents can include Web pages scattered over the Internet, documents in a company or other organizations, and documents on personal computers.
Various theoretical and practical attempts have been made to organize and determine the amount and relevancy of the information in natural language contents. Conventional techniques include search engines and document classification systems. In document search, information in the unstructured document data is accessed by sending queries to a search engine or index server that returns the documents believed to be relevant to the query. One problem with using queries to access unknown data is that the users often do not know what information is contained in the documents. Thus users often cannot come up with the right key words to effectively retrieve the most relevant information. Another problem is that conventional search engines cannot accurately determine the amount of information or the focus of information contained in a document, such that the results produced by conventional search engines usually contain many irrelevant data. Often, time is wasted before the needed information is found.
There is still a need for technologies that can provide more efficient ways for finding the needed information among a large number of documents, and provide alternative ways to conventional search in finding, organizing, and presenting such information.